The motivation for using mixed models stems from the intention to incorporate efforts to handle nuisance variance in the population we sampled as a part of our experiment. One of the main sources of this nuisance variance can be non-independence, which showed up in the split-plot design and repeated measures experiments.
Mixed models are known by a variety of names that is usually sub-field specific. The most common terms you may come across are: * Variance components * Random intercepts and slopes * Random effects * Varying coefficients * Intercepts - and/or slopes-as-outcomes * Hierarchical linear models * Multilevel models (implies multiple levels of hierarchically clustered data) * Growth curve models (possibly Latent GCM) * Mixed effects models
We will be using mixed effects models, which generally refer to the inclusion of both fixed and random effects. This terminology does not suggest specific structure (e.g., hierarchical, or multilevel). The term fixed effects refers to the non-random terms included in the model, which includes the variables of experimental interest. Random effects are often considered to be different grouping factors that may help define clustering of observational units.
The simplest and most common case of a mixed effects model is where there is a single grouping or cluster structure for the random effect. This is known as a random intercept model.
The first example comes from the InsectSprays data set where multiple pesticide sprays are used and the count of insects are observed.
## Rows: 48 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): spray
## dbl (1): count
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## spc_tbl_ [48 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ count: num [1:48] 10 7 20 14 14 12 10 23 17 20 ...
## $ spray: chr [1:48] "A" "A" "A" "A" ...
## - attr(*, "spec")=
## .. cols(
## .. count = col_double(),
## .. spray = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
ggplot(data, aes(x = spray, y = count)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(height = 0, width = 0.1)Tested to see if there is a relationship between the different sprays used and the count of insects observed. We can add blocks to the data by constructing a vector of factor variables. This code builds 12 different blocks in the data set with each block being comprised of 4 replicates for a total of 48 replicates or samples. Each boxplot is just a point and line.
## Rows: 48
## Columns: 3
## $ count <dbl> 10, 7, 20, 14, 14, 12, 10, 23, 17, 20, 14, 13, 11, 17, 21, 11, 1…
## $ spray <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "B",…
## $ block <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9…
ggplot(data, aes(x = spray, y = count)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(height = 0,width = 0.1) +
facet_wrap(~ block) # 12 blocksThe initial model considers the variation in count among the different sprays and ignores the blocking design.
# Running the Model ANOVA - ignoring the block design
Mod <- glm(count ~ spray, data = data)
car::Anova(Mod, test.statistic = "F")| Sum Sq | Df | F value | Pr(>F) | |
|---|---|---|---|---|
| spray | 1648.729 | 3 | 26.47836 | 0 |
| Residuals | 913.250 | 44 | NA | NA |
We can account for the blocking factor by adding the block variable as a fixed effect.
| Sum Sq | Df | F value | Pr(>F) | |
|---|---|---|---|---|
| spray | 1648.7292 | 3 | 32.529764 | 0.0000000 |
| block | 355.7292 | 11 | 1.914166 | 0.0737824 |
| Residuals | 557.5208 | 33 | NA | NA |
When we aren’t necessarily interested in testing the significance, or estimating parameters, for the block effect - we just want to account for it. Therefore, block may be more appropriately fitted as a random effect. This is called a linear mixed effects model with a block as a random effect (random intercept). The random intercept is specified using the (1|intercept) notation. This notation is used for both glmmTMB and lmer functions.
Mod3 <- lmer(count ~ spray + (1|block), data = data)
# Mod3 <- glmmTMB(count ~ spray + (1|block), data = data)
# anova(Mod) - no anova() method for glmmTMB
car::Anova(Mod3, test.statistic = "F")| F | Df | Df.res | Pr(>F) | |
|---|---|---|---|---|
| spray | 32.52976 | 3 | 33 | 0 |
## Linear mixed model fit by REML ['lmerMod']
## Formula: count ~ spray + (1 | block)
## Data: data
##
## REML criterion at convergence: 266.3
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.95239 -0.58269 -0.07399 0.60466 1.99778
##
## Random effects:
## Groups Name Variance Std.Dev.
## block (Intercept) 3.861 1.965
## Residual 16.895 4.110
## Number of obs: 48, groups: block, 12
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 14.5000 1.3152 11.025
## sprayB 0.8333 1.6780 0.497
## sprayC -12.4167 1.6780 -7.400
## sprayF 2.1667 1.6780 1.291
##
## Correlation of Fixed Effects:
## (Intr) sprayB sprayC
## sprayB -0.638
## sprayC -0.638 0.500
## sprayF -0.638 0.500 0.500
Now we can look at the diagnostics from both the fixed effects and mixed effects models.
## Object of Class DHARMa with simulated residuals based on 250 simulations with refit = FALSE . See ?DHARMa::simulateResiduals for help.
##
## Scaled residual values: 0.484 0.148 0.64 0.476 0.488 0.324 0.276 0.956 0.332 0.636 0.512 0.484 0.496 0.916 0.62 0.196 0.552 0.488 0.844 0.584 ...
## Object of Class DHARMa with simulated residuals based on 250 simulations with refit = FALSE . See ?DHARMa::simulateResiduals for help.
##
## Scaled residual values: 0.16 0.064 0.896 0.46 0.388 0.296 0.136 0.96 0.736 0.896 0.456 0.384 0.192 0.648 0.876 0.216 0.548 0.412 0.612 0.652 ...
Finally we can compare the estimated marginal means.
emmeans2 <- Mod2 %>%
emmeans(specs = "spray", type = "response") %>%
cld(Letters = letters)
emmeans2| spray | emmean | SE | df | lower.CL | upper.CL | .group | |
|---|---|---|---|---|---|---|---|
| 3 | C | 2.083333 | 1.186542 | 33 | -0.3307036 | 4.49737 | a |
| 1 | A | 14.500000 | 1.186542 | 33 | 12.0859630 | 16.91404 | b |
| 2 | B | 15.333333 | 1.186542 | 33 | 12.9192964 | 17.74737 | b |
| 4 | F | 16.666667 | 1.186542 | 33 | 14.2526297 | 19.08070 | b |
emmeans3 <- Mod3 %>%
emmeans(specs = "spray", type = "response") %>%
cld(Letters = letters)
emmeans3| spray | emmean | SE | df | lower.CL | upper.CL | .group | |
|---|---|---|---|---|---|---|---|
| 3 | C | 2.083333 | 1.315158 | 39.86165 | -0.5749872 | 4.741654 | a |
| 1 | A | 14.500000 | 1.315158 | 39.86165 | 11.8416794 | 17.158321 | b |
| 2 | B | 15.333333 | 1.315158 | 39.86165 | 12.6750128 | 17.991654 | b |
| 4 | F | 16.666667 | 1.315158 | 39.86165 | 14.0083461 | 19.324987 | b |
emmeans2 <- emmeans2 %>%
as_tibble()
emmeans3 <- emmeans3 %>%
as_tibble()
Plot_Mod2 <- ggplot() +
geom_point(data = data, aes(y = count, x = spray), position = position_jitter(width = 0.1)) + # dots representing the raw data
geom_boxplot(data = data, aes(y = count, x = spray), position = position_nudge(x = -0.25), width = 0.25, outlier.shape = NA) + # boxplot
geom_point(data = emmeans2, aes(y = emmean, x = spray), position = position_nudge(x = 0.15), size = 2,color = "red") + # red dots representing the adjusted means
geom_errorbar(data = emmeans2, aes(ymin = lower.CL, ymax = upper.CL, x = spray), position = position_nudge(x = 0.15), color = "red", width = 0.1) + # red error bars representing the confidence limits of the adjusted means
geom_text(data = emmeans2, aes(y = emmean, x = spray, label = str_trim(.group)), position = position_nudge(x = 0.25), color = "black", angle = 0) + # red letters
labs(y = "Counts", x = "Treatment") +
theme_classic()
Plot_Mod2Plot_Mod3 <- ggplot() +
geom_point(data = data, aes(y = count, x = spray), position = position_jitter(width = 0.1)) + # dots representing the raw data
geom_boxplot(data = data, aes(y = count, x = spray), position = position_nudge(x = -0.25), width = 0.25, outlier.shape = NA) + # boxplot
geom_point(data = emmeans3, aes(y = emmean, x = spray), position = position_nudge(x = 0.15), size = 2, color = "red") + # red dots representing the adjusted means
geom_errorbar(data = emmeans3, aes(ymin = lower.CL, ymax = upper.CL, x = spray), position = position_nudge(x = 0.15), color = "red", width = 0.1) + # red error bars representing the confidence limits of the adjusted means
geom_text(data = emmeans3, aes(y = emmean, x = spray, label = str_trim(.group)), position = position_nudge(x = 0.25), color = "black", angle = 0) + # red letters
labs(y = "Counts", x = "Treatment") +
theme_classic()
Plot_Mod3The estimates for the fixed effects and random effects model produce similar results (wider confidence intervals for the random effects model). However, by treating the block as random effect, we can conceptualize these blocks as representing a sub-sampling of the larger population of potential blocks. This allows us to apply the conclusions from this model to other blocks from the large population.
This question, “should I treat factor XXX as fixed or random?” can be a very complicated question with competing opinions and definitions. Some examples from relevant literature are given for context. You should think critically and find evidence to support your decisions.
Fixed effects are unknown constants that we wish to estimate from the model and could be similar estimates in subsequent experimentation. The research is interested in these particular estimates.
Random effects are random variables sampled from a population which cannot be observed in subsequent experimentation. The research is not interested in these particular levels, but rather how the estimates vary from sample to sample.
From Gelman 2005: 1) Fixed effects are constant across individuals, and random effects vary - (e.g., random intercepts and fixed slopes results in parallel lines with random intercepts). 2) Effects are fixed if they are interesting in themselves or random if there is interest in the underlying population. 3) When a sample exhausts the population, the corresponding variable is fixed; when the sample is a small (i.e., negligible) part of the population the corresponding variable is random. 4) If an effect is assumed to be a realized value or a random variable, it is called a random effect. 5) Fixed effects are estimated using least squares (or, more generally, maximum likelihood) and random effects are estimated with shrinkage (e.g., linear unbiased prediction).
“…one modeler’s random effect is another modeler’s fixed effect” (Schabenberger and Pierce 2001).
One idea that is particularly contested involves the number of levels within a random effect (5-6 minimum). This advice comes from a pragmatic approach to ensure that you have enough variance of the population, but a more philosophical forward opinion would focus on the purpose of the term as it is included in the model.
Another major differences for mixed-effect models is that we can calculate the variance component of the random effects. The variance component is how much variation there is among the intercepts of the levels of the random effect. In the InsectSprays example, if the block had very little effect on the insect counts (all blocks are about the same), the variance component would be low (near zero). However, if there was a large amount of variation among the blocks (some blocks as very few insects and some had a lot), the variance component would be high. The concept of variance components is closely related to the coefficient of determination or the R-squared.
We know where the R-squared value is reported with a fixed effects model. There are a few more steps to extract the R-squared value for a mixed effects model or when we use the glmmTMB function. The fixed effects model gives us the R-squared and Adjusted R-squared value. The mixed effects model gives us the Conditional R-squared (fixed plus random effects) and Marginal R-squared (fixed effects). This gives us an assessment of how much variance is attributed to the random effect compared to the whole model.
## R2: 0.782
## # R2 for Mixed Models
##
## Conditional R2: 0.697
## Marginal R2: 0.628
Mixed effects models become further complicated when we move beyond the simple random intercept model.
Mixed effects models can have one or multiple sources of clustering that require fitting different random effects to best capture the variance attributed to those observational units. The structure of this clustering may be hierarchical or other complicated forms, which can be a source of confusion for those new to mixed effects models. A classic example would be students GPA observed multiple times (repeated observations nested within students) from multiple classrooms (multiple students nested within classroom) across multiple schools (classrooms nested within schools). Another example may be yield of a random selection crop varieties tested at multiple locations.
Another set of terms you may come across when discussing random effects are nested and crossed. These terms are used to describe the data, but recognizing this characteristic of the data can be important when specifying the random effects terms in the model. Nested random effects are when each member of one group is contained entirely within a single unit of another group - think about the student in the classroom example above. Crossed random effects are when this nesting is not true - think about the different crop varieties tested at multiple locations where the same variety can be tested at multiple locations and each location can have multiple varieties.
This mainly comes down to how each level of clustered data are coded. As long as the levels of the nested variable are unique across the data as opposed to unique within each of the nesting variable, nested effects and crossed effects are identical. Review this link for more explanation:
You may recognize this model structure from the examples used in the split-plot design lecture.
The data for this example is a slightly modified version of the yield (kg/ha) trial laid out as a split-plot design (Gomez & Gomez 1984). The trial had 4 genotypes (G), 6 nitrogen levels (N or n_amount) with 3 complete replicates (rep) and 6 incomplete blocks (mainplot) within each replicate.
## Rows: 72 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): rep, mainplot, G, N
## dbl (4): yield, row, col, n_amount
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
desplot(data = dat,
form = rep ~ col + row|rep, # fill color per rep, headers per rep
text = G, cex = 1, shorten = "no", # show genotype names per plot
col = N, # color of genotype names for each N-level
out1 = mainplot, out1.gpar = list(col = "black"), # lines between mainplots
out2 = row, out2.gpar = list(col = "darkgrey"), # lines between rows
main = "Field Layout", show.key = T, key.cex = 0.7) # formattingmod_re <- lmerTest::lmer(yield ~ N * G + (1|rep) + (1|rep:mainplot), data = dat) # (1|rep/mainplot) same syntax for lmer function - glmmTMB may produce different result## boundary (singular) fit: see help('isSingular')
# isSingular warning message tells us that one of the random effects is accounting for very little variation - since this random effect is related to the experimental design we will NOT ignore it
summary(mod_re)## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: yield ~ N * G + (1 | rep) + (1 | rep:mainplot)
## Data: dat
##
## REML criterion at convergence: 780.8
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.8506 -0.5356 -0.1311 0.4778 1.9272
##
## Random effects:
## Groups Name Variance Std.Dev.
## rep:mainplot (Intercept) 5.086e+04 225.51172
## rep (Intercept) 3.070e-03 0.05541
## Residual 3.494e+05 591.13579
## Number of obs: 72, groups: rep:mainplot, 18; rep, 3
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 4252.67 365.28 45.78 11.642 2.79e-15 ***
## NN2 1419.33 516.59 45.78 2.748 0.00856 **
## NN3 2147.33 516.59 45.78 4.157 0.00014 ***
## NN4 2480.00 516.59 45.78 4.801 1.73e-05 ***
## NN5 3310.67 516.59 45.78 6.409 7.18e-08 ***
## NN6 4448.00 516.59 45.78 8.610 3.94e-11 ***
## GB 53.33 482.66 36.00 0.110 0.91263
## GC -1075.33 482.66 36.00 -2.228 0.03222 *
## GD 228.67 482.66 36.00 0.474 0.63853
## NN2:GB 256.67 682.58 36.00 0.376 0.70911
## NN3:GB -194.33 682.58 36.00 -0.285 0.77750
## NN4:GB 109.00 682.58 36.00 0.160 0.87402
## NN5:GB -666.00 682.58 36.00 -0.976 0.33572
## NN6:GB -2213.67 682.58 36.00 -3.243 0.00255 **
## NN2:GC 846.00 682.58 36.00 1.239 0.22321
## NN3:GC 669.33 682.58 36.00 0.981 0.33334
## NN4:GC 356.67 682.58 36.00 0.523 0.60451
## NN5:GC 199.33 682.58 36.00 0.292 0.77194
## NN6:GC -1560.00 682.58 36.00 -2.285 0.02828 *
## NN2:GD -1084.67 682.58 36.00 -1.589 0.12079
## NN3:GD -1816.67 682.58 36.00 -2.661 0.01155 *
## NN4:GD -3145.33 682.58 36.00 -4.608 4.95e-05 ***
## NN5:GD -5745.33 682.58 36.00 -8.417 5.01e-10 ***
## NN6:GD -7048.67 682.58 36.00 -10.326 2.62e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation matrix not shown by default, as p = 24 > 12.
## Use print(x, correlation=TRUE) or
## vcov(x) if you need it
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')
## Groups Name Std.Dev.
## rep:mainplot (Intercept) 225.511719
## rep (Intercept) 0.055412
## Residual 591.135787
| (Intercept) | |
|---|---|
| rep1 | -1.45e-05 |
| rep2 | 2.29e-05 |
| rep3 | -8.30e-06 |
| (Intercept) | |
|---|---|
| rep1:mp1 | 9.811810 |
| rep1:mp2 | 63.224824 |
| rep1:mp3 | -218.067361 |
| rep1:mp4 | -28.055625 |
| rep1:mp5 | -153.708053 |
| rep1:mp6 | 86.282566 |
| rep2:mp10 | -7.297538 |
| rep2:mp11 | 207.335697 |
| rep2:mp12 | 170.756061 |
| rep2:mp7 | 116.821795 |
| rep2:mp8 | 7.297522 |
| rep2:mp9 | -116.269898 |
| rep3:mp13 | 47.311300 |
| rep3:mp14 | -179.280072 |
| rep3:mp15 | 36.886258 |
| rep3:mp16 | -2.514272 |
| rep3:mp17 | -70.522346 |
| rep3:mp18 | 29.987332 |
Here we have multiple random effects - replicate and mainplot within replicate.
| F | Df | Df.res | Pr(>F) | |
|---|---|---|---|---|
| (Intercept) | 135.538147 | 1 | 45.78314 | 0.0000000 |
| N | 17.621854 | 5 | 41.96547 | 0.0000000 |
| G | 3.016614 | 3 | 36.00000 | 0.0424007 |
| N:G | 13.235986 | 15 | 36.00000 | 0.0000000 |
withinG_mean_comparisons_tukey_re <- mod_re %>%
emmeans(specs = ~ N|G) %>%
cld(Letters = letters)
withinG_mean_comparisons_tukey_re| N | G | emmean | SE | df | lower.CL | upper.CL | .group | |
|---|---|---|---|---|---|---|---|---|
| 1 | N1 | A | 4252.667 | 365.2839 | 45.78314 | 3517.294 | 4988.039 | a |
| 2 | N2 | A | 5672.000 | 365.2839 | 45.78314 | 4936.628 | 6407.372 | ab |
| 3 | N3 | A | 6400.000 | 365.2839 | 45.78314 | 5664.628 | 7135.372 | bc |
| 4 | N4 | A | 6732.667 | 365.2839 | 45.78314 | 5997.294 | 7468.039 | bc |
| 5 | N5 | A | 7563.333 | 365.2839 | 45.78314 | 6827.961 | 8298.706 | cd |
| 6 | N6 | A | 8700.667 | 365.2839 | 45.78314 | 7965.294 | 9436.039 | d |
| 7 | N1 | B | 4306.000 | 365.2839 | 45.78314 | 3570.628 | 5041.372 | a |
| 8 | N2 | B | 5982.000 | 365.2839 | 45.78314 | 5246.628 | 6717.372 | b |
| 9 | N3 | B | 6259.000 | 365.2839 | 45.78314 | 5523.628 | 6994.372 | b |
| 12 | N6 | B | 6540.333 | 365.2839 | 45.78314 | 5804.961 | 7275.706 | b |
| 10 | N4 | B | 6895.000 | 365.2839 | 45.78314 | 6159.628 | 7630.372 | b |
| 11 | N5 | B | 6950.667 | 365.2839 | 45.78314 | 6215.294 | 7686.039 | b |
| 13 | N1 | C | 3177.333 | 365.2839 | 45.78314 | 2441.961 | 3912.706 | a |
| 14 | N2 | C | 5442.667 | 365.2839 | 45.78314 | 4707.294 | 6178.039 | b |
| 15 | N3 | C | 5994.000 | 365.2839 | 45.78314 | 5258.628 | 6729.372 | b |
| 16 | N4 | C | 6014.000 | 365.2839 | 45.78314 | 5278.628 | 6749.372 | b |
| 18 | N6 | C | 6065.333 | 365.2839 | 45.78314 | 5329.961 | 6800.706 | b |
| 17 | N5 | C | 6687.333 | 365.2839 | 45.78314 | 5951.961 | 7422.706 | b |
| 24 | N6 | D | 1880.667 | 365.2839 | 45.78314 | 1145.294 | 2616.039 | a |
| 23 | N5 | D | 2046.667 | 365.2839 | 45.78314 | 1311.294 | 2782.039 | a |
| 22 | N4 | D | 3816.000 | 365.2839 | 45.78314 | 3080.628 | 4551.372 | b |
| 19 | N1 | D | 4481.333 | 365.2839 | 45.78314 | 3745.961 | 5216.706 | b |
| 21 | N3 | D | 4812.000 | 365.2839 | 45.78314 | 4076.628 | 5547.372 | b |
| 20 | N2 | D | 4816.000 | 365.2839 | 45.78314 | 4080.628 | 5551.372 | b |
withinG_mean_comparisons_tukey_re <- withinG_mean_comparisons_tukey_re %>%
as_tibble() %>%
mutate(N_G = paste0(N,"-",G)) %>%
mutate(N_G = fct_reorder(N_G, emmean))
dat <- dat %>%
mutate(N_G = paste0(N,"-",G)) %>%
mutate(N_G = fct_relevel(N_G, levels(withinG_mean_comparisons_tukey_re$N_G)))
withinG_RCBD_Plot_tukey <- ggplot() +
facet_wrap(~ G, labeller = label_both) + # facet per G level
geom_point(data = dat, aes(y = yield, x = N, color = N)) + # dots representing the raw data
geom_point(data = withinG_mean_comparisons_tukey_re, aes(y = emmean, x = N), color = "red", position = position_nudge(x = 0.1)) + # red dots representing the adjusted means
geom_errorbar(data = withinG_mean_comparisons_tukey_re, aes(ymin = lower.CL, ymax = upper.CL, x = N), color = "red", width = 0.1, position = position_nudge(x = 0.1)) + # red error bars representing the confidence limits of the adjusted means
geom_text(data = withinG_mean_comparisons_tukey_re, aes(y = emmean, x = N, label = str_trim(.group)), color =" red", position = position_nudge(x = 0.2), hjust = 0) + # red letters
scale_y_continuous(name = "Yield", limits = c(0, NA), expand = expansion(mult = c(0, 0.1))) +
scale_x_discrete(name = NULL) +
theme_classic() + # clearer plot format
labs(caption = str_wrap("Black dots represent raw data. Red dots and error bars represent estimated marginal means ± 95% confidence interval per group. Means not sharing any letter are significantly different by the t-test at the 5% level of significance.", width = 120)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1), legend.position = "bottom")
withinG_RCBD_Plot_tukeyMixed effects models become even further complicated with you consider random slopes in addition to random intercepts. The model specification details found at this link can help you draft the specific syntax to represent the random effects of your model.
These data come from the rikz dataset of 45 intertidal sites across 9 beaches in the Netherlands. We first want to model the change in richness of species along the position of the site relative to mean sea level (NAP) and a random variation of the intercept between the beaches (linear mixed effect model with a random intercept). We can also consider a model where the effect of the NAP on the response varies from one beach to another (linear mixed effect model with a random slope and intercept).
## Rows: 45 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (5): Richness, Exposure, NAP, Beach, Site
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## spc_tbl_ [45 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Richness: num [1:45] 11 10 13 11 10 8 9 8 19 17 ...
## $ Exposure: num [1:45] 10 10 10 10 10 8 8 8 8 8 ...
## $ NAP : num [1:45] 0.045 -1.036 -1.336 0.616 -0.684 ...
## $ Beach : num [1:45] 1 1 1 1 1 2 2 2 2 2 ...
## $ Site : num [1:45] 1 2 3 4 5 1 2 3 4 5 ...
## - attr(*, "spec")=
## .. cols(
## .. Richness = col_double(),
## .. Exposure = col_double(),
## .. NAP = col_double(),
## .. Beach = col_double(),
## .. Site = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
ggplot(data, aes(y = Richness, x = NAP)) +
geom_point() +
xlab("NAP") +
ylab("Richness") +
theme_classic(base_size = 15) +
stat_smooth(method = "lm", formula = 'y ~ x', se = F, fullrange = T) +
facet_wrap(~ Beach)With especially complicated models it is common to have convergence issues. The following link can help troubleshoot issues with lmer type models based on the warning/error messages printed after fitting a model.
mod_rint <- lmer(Richness ~ NAP + (1|Beach), data, REML = T)
mod_rsint <- lmer(Richness ~ NAP + (1 + NAP|Beach), data, REML = T)## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.00267151 (tol = 0.002, component 1)
# Restarting from previous fit and increased max number of iterations
ss <- getME(mod_rsint, c("theta", "fixef"))
mod_rsint2 <- update(mod_rsint, start = ss, control = lmerControl(optCtrl = list(maxfun = 2e4)), REML = T)
summary(mod_rsint2)## Linear mixed model fit by REML ['lmerMod']
## Formula: Richness ~ NAP + (1 + NAP | Beach)
## Data: data
## Control: lmerControl(optCtrl = list(maxfun = 20000))
##
## REML criterion at convergence: 232.4
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.8211 -0.3411 -0.1675 0.1924 3.0400
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## Beach (Intercept) 12.596 3.549
## NAP 2.940 1.715 -0.99
## Residual 7.307 2.703
## Number of obs: 45, groups: Beach, 9
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 6.5885 1.2648 5.209
## NAP -2.8300 0.7229 -3.915
##
## Correlation of Fixed Effects:
## (Intr)
## NAP -0.819
# Compare random effects structure to see which fits the data best
AICcmodavg::aictab(
cand.set = list(mod_rint, mod_rsint2),
modnames = c("Intercept", "Slope+Intercept"),
second.ord = FALSE) # get AIC instead of AICc## Warning in aictab.AIClmerMod(cand.set = list(mod_rint, mod_rsint2), modnames = c("Intercept", :
## Model selection for fixed effects is only appropriate with ML estimation:
## REML (default) should only be used to select random effects for a constant set of fixed effects
| Modnames | K | AIC | Delta_AIC | ModelLik | AICWt | Res.LL | Cum.Wt | |
|---|---|---|---|---|---|---|---|---|
| 2 | Slope+Intercept | 6 | 244.3839 | 0.000000 | 1.0000000 | 0.824652 | -116.1919 | 0.824652 |
| 1 | Intercept | 4 | 247.4802 | 3.096378 | 0.2126327 | 0.175348 | -119.7401 | 1.000000 |
# Residual plot looks problematic - check omitted variable bias and include Exposure in model
mod_rsint1 <- lmer(Richness ~ NAP + (1 + NAP|Beach), data, REML = F)## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
| npar | AIC | BIC | logLik | -2*log(L) | Chisq | Df | Pr(>Chisq) | |
|---|---|---|---|---|---|---|---|---|
| mod_rsint1 | 6 | 246.6561 | 257.496 | -117.3280 | 234.6561 | NA | NA | NA |
| mod_rsint2 | 10 | 238.1993 | 256.266 | -109.0997 | 218.1993 | 16.45673 | 4 | 0.0024637 |
# Interaction fixed effect has better model fitness - need to use REML=F to test fitness of fixed effects
plot(residuals(mod_rsint2) ~ fitted(mod_rsint2))
abline(h = 0)# Didn't resolve heteroscedasticity issue with residuals - try a count model
mod_glm_rsint <- glmer(Richness ~ NAP * Exposure + (1 + NAP|Beach), data, family = poisson(link = "log"))
plot(residuals(mod_glm_rsint) ~ fitted(mod_glm_rsint))
abline(h = 0)## # Overdispersion test
##
## dispersion ratio = 0.895
## Pearson's Chi-Squared = 32.228
## p-value = 0.649
## No overdispersion detected.
# Selected model steps
# 1) determine random effects structure using REML=T
# 2) determine fixed effects structure using REML=F
# 3) determine appropriate distribution normal, transformation, count, etc.
# 4) move onto assessing Anova results
car::Anova(mod_glm_rsint, type = 3) # cannot call test.statistic = "F" for glmerMod object| Chisq | Df | Pr(>Chisq) | |
|---|---|---|---|
| (Intercept) | 148.454216 | 1 | 0.0000000 |
| NAP | 1.316567 | 1 | 0.2512092 |
| Exposure | 29.587327 | 2 | 0.0000004 |
| NAP:Exposure | 1.573073 | 2 | 0.4554195 |
| Chisq | Df | Pr(>Chisq) | |
|---|---|---|---|
| NAP | 24.231343 | 1 | 0.0000009 |
| Exposure | 30.890508 | 2 | 0.0000002 |
| NAP:Exposure | 1.573073 | 2 | 0.4554195 |
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: poisson ( log )
## Formula: Richness ~ NAP * Exposure + (1 + NAP | Beach)
## Data: data
##
## AIC BIC logLik -2*log(L) df.resid
## 212.7 228.9 -97.3 194.7 36
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.4268 -0.5756 -0.1351 0.1743 2.0062
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## Beach (Intercept) 0.02686 0.1639
## NAP 0.05272 0.2296 -0.22
## Number of obs: 45, groups: Beach, 9
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.5349 0.2080 12.184 < 2e-16 ***
## NAP -0.3037 0.2646 -1.147 0.2512
## Exposure10 -0.5702 0.2447 -2.330 0.0198 *
## Exposure11 -1.3585 0.2613 -5.200 1.99e-07 ***
## NAP:Exposure10 -0.3969 0.3166 -1.254 0.2100
## NAP:Exposure11 -0.2685 0.3301 -0.813 0.4160
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) NAP Exps10 Exps11 NAP:E10
## NAP -0.151
## Exposure10 -0.851 0.128
## Exposure11 -0.795 0.121 0.675
## NAP:Expsr10 0.127 -0.837 -0.124 -0.100
## NAP:Expsr11 0.122 -0.801 -0.110 -0.117 0.673
Now that the best fitting model has been chosen, we can move forward with interpreting the results.
Exposure_means <- mod_glm_rsint %>%
emmeans(~ Exposure) %>%
cld(Letters = letters, type = "response")## NOTE: Results may be misleading due to involvement in interactions
| Exposure | rate | SE | df | asymp.LCL | asymp.UCL | .group | |
|---|---|---|---|---|---|---|---|
| 3 | 11 | 2.657518 | 0.4496860 | Inf | 1.907401 | 3.702631 | a |
| 2 | 10 | 5.590957 | 0.7752359 | Inf | 4.260489 | 7.336903 | b |
| 1 | 8 | 11.351010 | 2.4336616 | Inf | 7.456526 | 17.279553 | c |
trends <- mod_glm_rsint %>%
emtrends(pairwise ~ Exposure, var = "NAP") %>%
cld(Letters = letters, type = "response")
trends| Exposure | NAP.trend | SE | df | asymp.LCL | asymp.UCL | .group | |
|---|---|---|---|---|---|---|---|
| 2 | 10 | -0.7005764 | 0.1734339 | Inf | -1.0405006 | -0.3606523 | a |
| 3 | 11 | -0.5722070 | 0.1976221 | Inf | -0.9595392 | -0.1848748 | a |
| 1 | 8 | -0.3036633 | 0.2646494 | Inf | -0.8223666 | 0.2150400 | a |
We need to produce a trend line for each exposure level. We will cover two approaches that are more complicated than the previously used predict function.
## Richness Exposure NAP Beach Site
## Min. : 0.000 8 : 5 Min. :-1.3360 1 : 5 Min. :1
## 1st Qu.: 3.000 10:20 1st Qu.:-0.3750 2 : 5 1st Qu.:2
## Median : 4.000 11:20 Median : 0.1670 3 : 5 Median :3
## Mean : 5.689 Mean : 0.3477 4 : 5 Mean :3
## 3rd Qu.: 8.000 3rd Qu.: 1.1170 5 : 5 3rd Qu.:4
## Max. :22.000 Max. : 2.2550 6 : 5 Max. :5
## (Other):15
data$Site <- as.factor(data$Site)
new.x <- expand.grid(NAP = seq(-1.336, 2.255, length.out = 100),
Exposure = levels(data$Exposure),
Beach = levels(data$Beach),
Site = levels(data$Site))
new.x$Richness <- predict(mod_glm_rsint, new.x, re.form = NA, type = "response")
# There is no method using the predict function to derive standard errors and thus no confidence intervals using the predict function.
# Some complex matrix multiplication to produce 95% confidence intervals
mm <- model.matrix(terms(mod_glm_rsint), new.x)
pvar1 <- diag(mm %*% tcrossprod(vcov(mod_glm_rsint), mm))
tvar1 <- pvar1 + VarCorr(mod_glm_rsint)$Beach[1]
cmult <- 2
new.x <- data.frame(new.x
, plo = new.x$NAP - cmult * sqrt(pvar1)
, phi = new.x$NAP + cmult * sqrt(pvar1)
, tlo = new.x$NAP - cmult * sqrt(tvar1)
, thi = new.x$NAP + cmult * sqrt(tvar1))
# Fixed Effects Only
FE_Plot <- ggplot(data, aes(x = NAP, y = Richness, color = Exposure)) +
geom_point(size = 5) +
geom_line(data = new.x, aes(x = NAP, y = Richness, color = Exposure)) +
geom_line(data = new.x, aes(x = NAP, y = Richness - plo, color = Exposure), lty = 2) +
geom_line(data = new.x, aes(x = NAP, y = Richness + phi, color = Exposure), lty = 2) +
scale_color_manual(values = c(`8` = "blue", `10` = "red", `11` = "green")) +
scale_fill_manual(values = c(`8` = "blue", `10` = "red", `11` = "green")) +
theme_classic()
FE_Plot## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's fill values.
# CI based on FE and RE
FE_RE_Plot <- ggplot(data, aes(x = NAP, y = Richness, color = Exposure)) +
geom_point(size = 5) +
geom_line(data = new.x, aes(x = NAP, y = Richness, color = Exposure)) +
geom_line(data = new.x, aes(x = NAP, y = Richness - tlo, color = Exposure), lty = 2) +
geom_line(data = new.x, aes(x = NAP, y = Richness + thi, color = Exposure), lty = 2) +
scale_color_manual(values = c(`8` = "blue", `10` = "red", `11` = "green")) +
scale_fill_manual(values = c(`8` = "blue", `10` = "red", `11` = "green")) +
theme_classic()
FE_RE_Plot## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's fill values.
Exposure_means <- Exposure_means %>%
as_tibble()
Plot_Exposure <- ggplot() +
geom_point(data = data, aes(y = Richness, x = Exposure), position = position_jitter(width = 0.1)) + # dots representing the raw data
geom_boxplot(data = data, aes(y = Richness, x = Exposure), position = position_nudge(x = -0.25), width = 0.25, outlier.shape = NA) + # boxplot
geom_point(data = Exposure_means, aes(y = rate, x = Exposure), position = position_nudge(x = 0.15), size = 2, color = "red") + # red dots representing the adjusted means
geom_errorbar(data = Exposure_means, aes(ymin = asymp.LCL, ymax = asymp.UCL, x = Exposure), position = position_nudge(x = 0.15), color = "red", width = 0.1) + # red error bars representing the confidence limits of the adjusted means
geom_text(data = Exposure_means, aes(y = rate, x = Exposure, label = str_trim(.group)), position = position_nudge(x = 0.25), color = "black", angle = 0) + # red letters
labs(y = "Richness", x = "Exposure", tag = "C") +
theme_classic()
Plot_Exposure## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's fill values.
## No shared levels found between `names(values)` of the manual scale and the
## data's fill values.
new_x <- expand.grid(NAP = seq(-1.336, 2.255, length.out = 100),
Exposure = levels(data$Exposure),
Beach = levels(data$Beach),
Site = levels(data$Site))
mySumm<-function(.){
predict(.,newdata = new.x, re.form = NA)
}
sumBoot <- function(merBoot) {
return(
data.frame(fit = apply(merBoot$t, 2, function(x) as.numeric(quantile(x, probs = 0.5, na.rm = TRUE))),
lwr = apply(merBoot$t, 2, function(x) as.numeric(quantile(x, probs = 0.025, na.rm = TRUE))),
upr = apply(merBoot$t, 2, function(x) as.numeric(quantile(x, probs = 0.975, na.rm = TRUE)))
)
)
}
boot1 <- lme4::bootMer(mod_glm_rsint, mySumm, nsim = 250, use.u = FALSE, type = "parametric")
PI.boot1 <- sumBoot(boot1)
head(PI.boot1)| fit | lwr | upr |
|---|---|---|
| 3.002342 | 1.848330 | 3.731325 |
| 2.989927 | 1.856632 | 3.704538 |
| 2.976820 | 1.864935 | 3.677143 |
| 2.961818 | 1.873238 | 3.648247 |
| 2.946816 | 1.880132 | 3.620555 |
| 2.932186 | 1.886986 | 3.593678 |
new_data <- data.frame(new_x, PI.boot1) %>%
rename(Richness = fit) %>%
mutate(Richness = exp(Richness),
lwr = exp(lwr),
upr = exp(upr))
# Incorporates random effects
FE_RE_Plot <- ggplot(data, aes(x = NAP, y = Richness, color = Exposure)) +
geom_point(size = 2, alpha = 0.5) +
geom_line(data = new_data, aes(x = NAP, y = Richness, color = Exposure), linewidth = 2) +
geom_line(data = new_data, aes(x = NAP, y = Richness - lwr, color = Exposure), lty = 2) +
geom_line(data = new_data, aes(x = NAP, y = Richness + upr, color = Exposure), lty = 2) +
scale_color_manual(values = c(`8` = "blue", `10` = "red", `11` = "green")) +
scale_fill_manual(values = c(`8` = "blue", `10` = "red", `11` = "green")) +
theme_classic()
FE_RE_Plot## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's fill values.
boot2 <- lme4::bootMer(mod_glm_rsint, mySumm, nsim = 250, use.u = TRUE, type = "parametric")
PI.boot2 <- sumBoot(boot2)
new_data_2 <- data.frame(new_x, PI.boot2) %>%
rename(Richness = fit) %>%
mutate(Richness = exp(Richness),
lwr = exp(lwr),
upr = exp(upr))
# Includes fixed effects only
FE_RE_Plot_2 <- ggplot(data, aes(x = NAP, y = Richness, color = Exposure)) +
geom_point(size = 5) +
geom_line(data = new_data_2, aes(x = NAP, y = Richness, color = Exposure)) +
geom_line(data = new_data_2, aes(x = NAP, y = Richness - lwr, color = Exposure), lty = 2) +
geom_line(data = new_data_2, aes(x = NAP, y = Richness + upr, color = Exposure), lty = 2) +
scale_color_manual(values = c(`8` = "blue", `10` = "red", `11` = "green")) +
scale_fill_manual(values = c(`8` = "blue", `10` = "red", `11` = "green")) +
theme_classic()
FE_RE_Plot_2## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's fill values.
This type of analysis is sometimes called a longitudinal study where repeated observations are taken on individual subjects. In agricultural research, an individual plant may have repeated observations. In this example dataset eighteen patients participated in a study in which they were allowed only three hours of sleep per night and their reaction time in a specific test was observed. The underlying correlation structure results in observations from the same individual or subject should be more similar than observations between two individuals or subjects. An initial review of the dataset and random intercept model is tested.
Reaction - average reaction time (ms) Days - number of days of sleep deprivation Subject - subject number on which the observation was made
## Rows: 180 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): Reaction, Days, Subject
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## spc_tbl_ [180 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Reaction: num [1:180] 250 259 251 321 357 ...
## $ Days : num [1:180] 0 1 2 3 4 5 6 7 8 9 ...
## $ Subject : num [1:180] 308 308 308 308 308 308 308 308 308 308 ...
## - attr(*, "spec")=
## .. cols(
## .. Reaction = col_double(),
## .. Days = col_double(),
## .. Subject = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
data$Subject <- as.factor(data$Subject)
ggplot(data, aes(y = Reaction, x = Days)) +
facet_wrap(~ Subject, ncol = 6) +
geom_point() +
geom_line()mod_intercept <- lmer(Reaction ~ Days + (1|Subject), data = data)
# Review model assumptions
plot(resid(mod_intercept) ~ fitted(mod_intercept)) # small hint of curved relationship
abline(h = 0)# Review contribution of each random effect and their predictors (BLUPs)
summary(mod_intercept)$varcor## Groups Name Std.Dev.
## Subject (Intercept) 37.124
## Residual 30.991
| (Intercept) | |
|---|---|
| 308 | 40.783710 |
| 309 | -77.849554 |
| 310 | -63.108567 |
| 330 | 4.406442 |
| 331 | 10.216189 |
| 332 | 8.221238 |
| 333 | 16.500494 |
| 334 | -2.996981 |
| 335 | -45.282127 |
| 337 | 72.182686 |
| 349 | -21.196249 |
| 350 | 14.111363 |
| 351 | -7.862221 |
| 352 | 36.378425 |
| 369 | 7.036381 |
| 370 | -6.362703 |
| 371 | -3.294273 |
| 372 | 18.115747 |
car::Anova(mod_intercept, test.statistic = "F") # Days is a continuous variable - summary table gives us the slope of the line| F | Df | Df.res | Pr(>F) | |
|---|---|---|---|---|
| Days | 169.4014 | 1 | 161 | 0 |
## Linear mixed model fit by REML ['lmerMod']
## Formula: Reaction ~ Days + (1 | Subject)
## Data: data
##
## REML criterion at convergence: 1786.5
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.2257 -0.5529 0.0109 0.5188 4.2506
##
## Random effects:
## Groups Name Variance Std.Dev.
## Subject (Intercept) 1378.2 37.12
## Residual 960.5 30.99
## Number of obs: 180, groups: Subject, 18
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 251.4051 9.7467 25.79
## Days 10.4673 0.8042 13.02
##
## Correlation of Fixed Effects:
## (Intr)
## Days -0.371
Each subject has their own baseline for reaction time and the subsequent measurements are relative to their baseline, so a random intercept will allow us to have each subject their unique baseline prediction. To visualize how well this model fits the data, we will plot the predicted values which are lines with y-intercepts that are equal to the sum of the fixed effect of intercept and the random intercept per subject. The slope for each patient is assumed to be the same and is 10.4673.
data <- data %>%
mutate(yhat = predict(mod_intercept, re.form = ~ (1|Subject))) # predict function calculated predictions based on model estimates and the re.form calculates the random intercepts
ggplot(data, aes(y = Reaction, x = Days)) +
facet_wrap(~ Subject, ncol = 6) +
geom_point() +
geom_line() + # original lines from raw data
geom_line(aes(y = yhat), color = 'red') # predicted lines from yhat valuesSome subjects have less deviation from their predicted lines, but this assumes each subject has the same slope. We can fit a model that allows for each subject to have their own slope as well as their own y-intercept. The random slope will be calculated as a fixed effect of slope plus a random offset from that.
mod_SI <- lmer(Reaction ~ Days + (1 + Days|Subject), data = data)
# Review model assumptions
plot(resid(mod_SI) ~ fitted(mod_SI)) # curved relationship is no longer - some extreme observations??
abline(h = 0)## Groups Name Std.Dev. Corr
## Subject (Intercept) 24.7407
## Days 5.9221 0.066
## Residual 25.5918
| (Intercept) | Days | |
|---|---|---|
| 308 | 2.2585509 | 9.1989758 |
| 309 | -40.3987381 | -8.6196806 |
| 310 | -38.9604090 | -5.4488565 |
| 330 | 23.6906196 | -4.8143503 |
| 331 | 22.2603126 | -3.0699116 |
| 332 | 9.0395679 | -0.2721770 |
| 333 | 16.8405086 | -0.2236361 |
| 334 | -7.2326151 | 1.0745816 |
| 335 | -0.3336684 | -10.7521652 |
| 337 | 34.8904868 | 8.6282652 |
| 349 | -25.2102286 | 1.1734322 |
| 350 | -13.0700342 | 6.6142178 |
| 351 | 4.5778642 | -3.0152621 |
| 352 | 20.8636782 | 3.5360011 |
| 369 | 3.2754656 | 0.8722149 |
| 370 | -25.6129993 | 4.8224850 |
| 371 | 0.8070461 | -0.9881562 |
| 372 | 12.3145921 | 1.2840221 |
car::Anova(mod_SI, test.statistic = "F") # Days is a continuous variable - summary table gives us the slope of the line| F | Df | Df.res | Pr(>F) | |
|---|---|---|---|---|
| Days | 45.85296 | 1 | 17 | 3.3e-06 |
## Linear mixed model fit by REML ['lmerMod']
## Formula: Reaction ~ Days + (1 + Days | Subject)
## Data: data
##
## REML criterion at convergence: 1743.6
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.9536 -0.4634 0.0231 0.4634 5.1793
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## Subject (Intercept) 612.10 24.741
## Days 35.07 5.922 0.07
## Residual 654.94 25.592
## Number of obs: 180, groups: Subject, 18
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 251.405 6.825 36.838
## Days 10.467 1.546 6.771
##
## Correlation of Fixed Effects:
## (Intr)
## Days -0.138
Review figure of each subject with their unique slope and intercept
data <- data %>%
mutate(yhat = predict(mod_SI, re.form = ~ (1 + Days|Subject)))
ggplot(data, aes(y = Reaction, x = Days)) +
facet_wrap(~ Subject, ncol = 6) +
geom_point() +
geom_line() +
geom_line(aes(y = yhat), color = 'red')We have an eye-ball test that tells us the random slope and intercept prediction lines fit the data better. We can employ a formal test to compare the fitness of each model (random intercept and random slope+intercept).
## refitting model(s) with ML (instead of REML)
| npar | AIC | BIC | logLik | -2*log(L) | Chisq | Df | Pr(>Chisq) | |
|---|---|---|---|---|---|---|---|---|
| mod_intercept | 4 | 1802.079 | 1814.850 | -897.0393 | 1794.079 | NA | NA | NA |
| mod_SI | 6 | 1763.939 | 1783.097 | -875.9697 | 1751.939 | 42.1393 | 2 | 0 |
The lower AIC and BIC values and the higher (less negative) log-likelihood value tells us that the random slope and intercept model is a better model than just a random intercept model.
We can review the results from the better fitting model with first the population estimate for the relationship between Days and Reaction time. Then the estimates for each subject.
data <- data %>%
mutate(yhat = predict(mod_SI, re.form = ~ 0))
ggplot(data, aes(x = Days, y = yhat)) +
geom_point(aes(x = Days, y = Reaction)) +
geom_line(color = 'red') + ylab('Reaction') +
ggtitle('Population Estimated Regression Curve') +
scale_x_continuous(breaks = seq(0, 9, by = 2))data <- data %>%
mutate(yhat.ind = predict(mod_SI, re.form = ~ (1 + Days|Subject)))
ggplot(data, aes(x = Days)) +
geom_line(aes(y = yhat), linewidth = 3) +
geom_line(aes(y = yhat.ind, group = Subject), color. ='red') +
scale_x_continuous(breaks = seq(0, 9, by = 2)) +
ylab('Reaction') + ggtitle('Person-to-Person Variation')## Warning in geom_line(aes(y = yhat.ind, group = Subject), color. = "red"):
## Ignoring unknown parameters: `colour.`
The final step in producing a figure that explains the relationship would be to incorporate confidence intervals around the predicted relationship. A familiar approach to produce confidence intervals around prediction line uses the predict function again. This approach may be easier to include other variables from more complex models. Those terms can be added to the expand.grid function.
# Find range of values for new body size range
min.days <- min(sleepstudy$Days)
max.days <- max(sleepstudy$Days)
# New x data
new.x <- expand.grid(Days = seq(min.days, max.days, length = 1000), Subject = levels(data$Subject))
# Generate fits and standard errors at new.x values
new.y <- predict(mod_SI, newdata = new.x, se.fit = TRUE, re.form = NA)
new.y <- data.frame(new.y)
# housekeeping to put new.x and new.y together
addThese <- data.frame(new.x, new.y)
addThese <- rename(addThese, Reaction = fit)
# Add confidence intervals
addThese <- mutate(addThese, lwr = Reaction -1.96 * se.fit,
upr = Reaction + 1.96 * se.fit)
# See how the confidence intervals match the raw data
ggplot(data, aes(x = Days, y = Reaction)) +
geom_point(size = 3, alpha = 0.5) +
geom_smooth(data = addThese, aes(ymin = lwr, ymax = upr), stat = "identity") +
theme_classic()Unfortunately, there isn’t a straight-forward way to easily incorporate the uncertainty of the variance components. Instead we have to rely on bootstrapping techniques (iterative process) to produce these quantities. This can be achieved using a function in the lme4 package following these steps:
First we need to find out some details of the sample population
## Linear mixed model fit by REML ['lmerMod']
## Formula: Reaction ~ Days + (1 + Days | Subject)
## Data: data
##
## REML criterion at convergence: 1743.6
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.9536 -0.4634 0.0231 0.4634 5.1793
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## Subject (Intercept) 612.10 24.741
## Days 35.07 5.922 0.07
## Residual 654.94 25.592
## Number of obs: 180, groups: Subject, 18
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 251.405 6.825 36.838
## Days 10.467 1.546 6.771
##
## Correlation of Fixed Effects:
## (Intr)
## Days -0.138
# Estimated intercept of 251.405
# Estimated slope of 10.467
# Subject intercept and slope random effect assumed to be normally distributed centered at zero and with estimated standard deviations of 24.741 and 5.922 (respectively)
# For a given subject's regression line, observations are just normal (mean zero, standard deviation 25.592) changes from the line.
# Creating bootstrap data
subject.intercept = 251.405 + rnorm(1, mean = 0, sd = 24.741)
subject.slope = 10.467 + rnorm(1, mean = 0, sd = 5.922)
c(subject.intercept, subject.slope)## [1] 283.70730 12.78199
subject.obs <- data.frame(Days = 0:9) %>%
mutate(Reaction = subject.intercept + subject.slope * Days + rnorm(10, sd=25.592))
ggplot(subject.obs, aes(x = Days, y = Reaction)) +
geom_point()This approach is commonly referred to as a “parametric” bootstrap because we are making some assumptions about the parameter distributions, whereas in a “non-parametric” bootstrap we don’t make any distributional assumptions. By default, the bootMer function will
ConfData <- data.frame(Days = 0:9) # What x-values I care about - much more complicated process when you have multiple variables in the model - need to create a matrix of those variables with a chosen value (e.g., mean or median)
# Get our best guess as to the relationship between day and reaction time
ConfData <- ConfData %>%
mutate( Estimate = predict(mod_SI, newdata = ConfData, re.form = ~ 0))
# A function to generate yhat from a model
myStats <- function(model.star){
out <- predict(model.star, newdata = ConfData, re.form = ~ 0)
return(out)
}
# bootMer generates new data sets, calls lmer on the data to produce a model,
# and then produces calls whatever function I pass in. It repeats this `nsim` number of times.
bootObj <- lme4::bootMer(mod_SI, FUN = myStats, nsim = 1000 )
# Unfortunately, it doesn't allow for the bias-corrected and accelerated method, but the percentile method is ok for visualizaton.
hist(bootObj, ci = 'perc')# Add the confidence interval values onto estimates data frame
CI <- confint(bootObj, level = 0.95, ci = 'bca') # Here we can get the bias-correct and accelerated option
colnames(CI) <- c('lwr', 'upr')
ConfData <- cbind(ConfData, CI)
ConfData %>%
ggplot(aes(x = Days)) +
geom_line(aes(y = Estimate), color = 'red') +
geom_ribbon(aes(ymin = lwr, ymax = upr), fill = 'salmon', alpha = 0.2)The prediction interval is the range of observed values. We can visualize the prediction interval in relation to the confidence interval. The simulate function creates the bootstrap dataset and doesn’t send it for more processing. It returns a vector of response values that are appropriately organized to be appended to the original dataset.
# # set up the structure of new subjects
PredData <- data.frame(Subject = 'new', Days = 0:9) # Simulate a NEW patient
# Create a n x 1000 data frame
Simulated <- simulate(mod_SI, newdata = PredData, allow.new.levels = TRUE, nsim = 1000)
# squish the Subject/Day info together with the simulated and then grab the quantiles
# for each day
PredIntervals <- cbind(PredData, Simulated) %>%
gather('sim', 'Reaction', sim_1:sim_1000 ) %>% # go from wide to long structure
group_by(Subject, Days) %>%
summarize(lwr = quantile(Reaction, probs = 0.025),
upr = quantile(Reaction, probs = 0.975))## `summarise()` has grouped output by 'Subject'. You can override using the
## `.groups` argument.
# Plot the prediction and confidence intervals
ggplot(ConfData, aes(x = Days)) +
geom_point(data = data, aes(x = Days, y = Reaction)) +
geom_line(aes(y = Estimate), color = 'red') +
geom_ribbon(aes(ymin = lwr, ymax = upr), fill = 'salmon', alpha = 0.2) +
geom_ribbon(data = PredIntervals, aes(ymin = lwr, ymax = upr), fill = 'blue', alpha = 0.2)